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Introduction to Arabic 


Natural Language Processing 


Nizar Habash 

Columbia University 

Center for Computational Learning Systems 

CjO Columbia University 

■ S5 ■ IN THE CITY OF MEW YORK 



Focus of this tutorial 

- Phenomena 

- Concepts 

- Approaches & Resources 

What is ‘Arabic’? 

- Arabic Script 

- Arabic Language 

• Modern Standard 
Arabic (MSA) 

• Arabic Dialects 



Road Map 


• Introduction 

• Orthography 

• Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 
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Road Map 


Introduction 

Orthography 

- Arabic Script 

- MSA Phonology and Spelling 

- Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/... 

- Encoding Issues 

Morphology 

Syntax 

Machine Translation Issues 
Dialects 


Arabic Script 


Modem 

Roman 


ABGDEFZH IKLMN OP QRST 


-h 


Early Latin AfcCCxf/^ZH IKLKN OT Q P <7 T 

1 I — } } } j — I — | | | | — — | - I - ♦ j I 'I — I — — — l — 

Greek AiUi-\ZE 7*111 on ©PIT 


Phoenician f L 7 1 fO^nwt 


Early 

Aramaic 

Nabatian 


Arabic 


O Mamoun Slid 1*997 


i *6 w 1 7 j loirpHt 


it jxitn i /ibvA-j j-ojvyappifA 

-L4L4-4-4-4-4-4-4444— 


L -J X & 9 jXl=.S‘i=»J-Dj-uji-.9.t=>.9 J-ui-J 
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Arabic Script 


Arabic script is an alphabet with allographs variants, 
optional zero-width diacritics and common ligatures. 



Arabic script is used to write many languages: Arabic 
Persian, Kurdish, Urdu, Pashto, etc. 



Arabic Script 


Alphabet 

• letter forms 

l9 J J /) U 0 3 lS f 

• letter marks . .. ... c ^ 

• Arabic only . .. c 

• Other languages 

:: : v v b ^ t 

• Persian, Kurdish, 

Urdu, Pashto, etc. .. . ° 

V ■■ ■ 

• OCR output ambiguity 


IwM) U-«J fjO )o P 





Arabic Script 


Alphabet (MSA) 


• letters (form+mark) . 

2 + ++ 

• Distinctive 

uuu 

% 

/;/ /S/ 

/0/ /t/ /b/ 

• Non-distinctive ^ 

Ls T tl 1 

£ 

/?/ 


glottal stop aka hamza 
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Arabic Script 


Letter Shapes 

• No distinction between print and handwriting 

• No capitalization 

• Right-to-left 

• Ambiguous 
shapes 

• Connective 
letters 

• Disconnective 
letters 


■ 

J 


1 

■ 

u 

u 

■ 

j 

P 

■ 

■ ■ 

■ 

£ 

Stand 

alone 

■ 

J 

j 

■ 


JO 

■ 

■ ■ 

_juJ 

■ 

initial 

■ 

> 

X 

L 

■ 

JL 

JL 

■ 


JX 

■ 

■ ■ 

-juUL 

■ 

SI 

medial 

O 

■ 

A 

p- 


■ 

& 

final 


Arabic Script 


Letter shaping 

= k-juS O dJ 

■ ■ ■ 

/katab/ b t k 

to write 

= I o dJ 

■ ■ ■ 

/kstab/ bat k 

book 
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Arabic Script 

Diacritics 

• Zero-width characters 

• Used for short vowels 

/katab/ to write 

• Nunation is used for 
nominal indefinite 
marker in MSA 

— 

/kitabun/ a book 


Nunation 


Vowel 




♦ 


♦ 

/ban/ 


/ba/ 




♦ 


♦ 

/bun/ 


/bu/ 



^ ♦ 

/bin/ 


/bi/ 
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Arabic Script 


Diacritics 


• No-vowel marker ( sukun ) 

— o — 

/maktab/ office 

• Double consonant marker 
(shadda) 

Cj — 

/kattab/ to dictate 


• Combinable 





/bbu/ /bbin/ /bban/ 
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Arabic Script 


Putting it together 

Simple combination 

Arab A'arab/ - v> c v j t 

O — O — 

West /barb/ v> c - v> c 4- v j & 

Ligatures 

J 

13 


Peace /salam/ 


jo\LuJ ioLLuJ >0 

X 


Arabic Script 


Tatweel 

• 'elongation' 

• aka kashida 

• used for text highlight ^ 

and justification 



human rights /huquq al?insan/ 
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Arabic Script 


• Different styles 

• High fluidity 

• Optional ligatures 

• Vertical 
arrangements 


Arabic 

Muhammad 

algebra 

♦♦ * 



♦♦ 




JuOJ^jO 



UA^A 


/9arabi/ 

/mufiammad / 

/alrfeabr/ 
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Arabic Script 

"Arabic" Numerals 

• Decimal system 

• Numbers written left-to-right in right-to-left text 

jjjJi ^ uip 132 Juu 1962 

Algeria achieved its independence in 1962 after i 32 years of French occupation. 

• Three systems of enumeration symbols that vary by region 


Western Arabic 

Tunisia, Morocco, etc. 

0 

i 

2 

3 

4 

5 

6 

7 

8 

9 

Indo-Arabic 

Middle East 

% 


\ 

r 

i 

0 


V 

A 

3 

Eastern Indo-Arabic 

Iran, Pakistan, etc. 

% 

> 

Y 

r 

f 

b 

f 

V 

A 

3 


Road Map 


Introduction 

Orthography 

- Arabic Script 

- MSA Phonology and Spelling 

- Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/... 

- Encoding Issues 

Morphology 

Syntax 

Machine Translation Issues 
Dialects 


MSA Phonology and Spelling 


• Phonological profile of Standard Arabic 

- 28 Consonants 

- 3 short vowels, 3 long vowels, 2 diphthongs 

• Arabic spelling is mostly phonemic ... 

- Letter-sound correspondence 



ljuwhnm lkqfB^Std S/szr 8 dxhd 30 tba? 
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MSA Phonology and Spelling 


• Arabic spelling is mostly phonemic ... 

Except for 

• Medial short vowels can only appear as 
diacritics 

• Diacritics are optional in most written text 

- Except in holy scripture 

- Present diacritics mark syntactic/semantic 
distinctions 

• /katab/ to write /kutib/ to be written 

• /hubb/ love /habb/ seed 

• Dual use of j, g? as consonant and long vowel 

- ' (/7,/a/) j (/w/,/u/) y; (/]/,/!/) 
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MSA Phonology and Spelling 


• Arabic spelling is mostly phonemic ... 

Except for (continued) 

• Morphophonemic characters 

- Feminine marker® ( ta marbuta) 

• /kablr/ (big S) "* /kablra / (big $) 

- Derivation marker 

• A'asa/ (to disobey (a stick 

• Hamza variants (6 characters for one phoneme!) 

- (Lsj)'' *) ^4? »j 4 /baha’/ + 3MascSing (his glory) 
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MSA Phonology and Spelling 


• Arabic spelling can be ambiguous 

- optional diacritics and dual use of letter 

• But how ambiguous? Really? 

• Classic example 

ths s wht n rbc txt Iks Ik wth n vwls 

this is what an Arabic text looks like with no vowels 

• Not exactly true 

- Long vowels are always written 

- Initial vowels are represented by an I ‘alef 

- Some final short vowels are represented 

ths is wht an Arbc txt Iks lik wth no vwls 
Will revisit ambiguity in more detail again under morphology discussion 
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Road Map 


Introduction 

Orthography 

- Arabic Script 

- MSA Phonology and Spelling 

- Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/... 

- Encoding Issues 

Morphology 

Syntax 

Machine Translation Issues 
Dialects 


Arabic Script 
Other languages 


Arabic 

• No more than 3 dots 

• Dots either above or below 

• Marks are 1 / 2 / 3 dots, hamza (s) 
or madda (~) only 

• Rare borrowing for foreign words 

• v/p/, ^ M, ^ s /g/, s /t s/ 

• regionally variable 



Not Arabic 

• Extra marks: haft (v), ring (o), taa 
four dots vertical dots (:) 

• Some Numerals 



Once you learn the alphabet, it is easier © 
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uU AS AJ J i ypj Aj Sj a£* Aj>j 

( ' * kJUiti o'3 a J c^-i) («j <G<U IjjAj <^£$j 

w • # 

( T } Jri^ cjJ a*^jj ^>3 AjJ Ad jld* jL? j Am aJ$j 
H ^ cri * Ad cf^ Aj ^ [£ j$a •? j Ad AJ 

J iiJ^ Aj (*iJ AJF ^AaJ A^ Aj AJ Lj j AJ AJ$J 
yjlj Aw AJ |^ A^ ^ A^-i A^ AJ AJ^S j Aj AJ>J 
( t ) (qJ^ ^IjLj ^ ii UC U ^33 A it cfj$j 
*0 AA *£>** AJ *U A*JJ AS ^ A^ AJ jd ^Jjj 

) i£j*A^ ^Li«Jd ^ jJLd ^ Ad Am AJ 

bjlj A^ AjL?- ^ AJ Ad J^S JUj * A£ Id A^ AJ 

•• • • 

< "\ ) 4 Aj Ad^od J Aj AJ AS uu>d$d j j A« ^aj^ cSj A^ 

(V) 43'l>j Aj pjj£ Ai A*-$j ^SAjj 

(A) $AJ cri>ii -3 J AJ <U AS ^U, JiAa JA* £ j+ALm 
( ^ ) ulLj ^A* <U AJ J->¥ <L& &)A» j ^S 3 j AJLj 

\ A \ \ A « 4 . J K J f' . ( ^ 4*. . * _ ^ ^ 


□ Arabic 

□ Not Arabic 



□ Arabic 

□ Not Arabic 


c_all ^LLj 

Ui JliLIj 

£ 

c q n«o A*_j ^jLujj tilj j 

C ^ uJoiM 

■ ■■t stjp ' ii ...(J^“ 
^ ^ jjiaj lW=-'j 
^Uj 

jjiJIj Ljljii^lj t-aji j ^ LM 

<jLlj ^j>4 djlS-L^al! (Jjuj^j! ^ j 
A^\ Ja^Ll ^»La! >.s-i\ Vj 


^ I JC-U^U ^g JjC- tjl . 



□ Arabic 

□ Not Arabic 




Road Map 


Introduction 

Orthography 

- Arabic Script 

- MSA Phonology and Spelling 

- Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/... 

- Encoding Issues 

Morphology 

Syntax 

Machine Translation Issues 
Dialects 


Encoding Issues 


Encoding Arabic 

- Data entry, storage, and display 

- Ease of use for Arabic-illiterate users 

- Multi-script support 

- Multilingual support (extended Arabic characters) 

Types of Encoding 

- Machine character sets 

• Graphemic (shape insensitive, logical order) 

• Allographs (shape/direction sensitive) [obsolete] 

- Human accessible 

• Transliteration 

• Phonetic spelling (IPA) 

• Romanization 



Encoding Issues 

• Many Conflicting Character Sets for Arabic 





Encodings 

• CP-1256 

- Commonly used 

- 1 -byte characters 

- Widely supported 
input/display 

- Minimal support for 
extended Arabic 
characters 

- bi-script support 
(Roman/Arabic) 

- Tri-lingual support: 
Arabic, French, 
English (ala ANSI) 


Codepage 1256 - Arabic Windows 



o 

1 

-1 

-2 

CO 

1 

-4 

to 

1 

-6 

-7 

1 

OO 

1 

CD 

-A 

-B 

-C 

-D 

1 

m 

-F 

0 

1 


0001 

0002 

0003 

0004 

0005 

0006 

0007 

0008 

0009 

00 0A 

000 B 

OOOC 

oooo 

000€ 

ODOF 

1- 

0010 

0011 

0012 

0013 

0014 

0015 

0016 

0017 

0018 

0019 

001A 

001 B 

001C 

0010 

001E 

001F 

1 

CM 


y 

• 

It 

# 

$ 

% 

& 

1 

( 

) 

* 

+ 

9 

- 


/ 


0020 

0021 

0022 

0023 

0024 

0025 

0026 

0027 

0028 

0029 

002A 

002 B 

002C 

0020 

002E 

002F 

3- 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

• 

• 

5 

< 

= 

> 

9 


0030 

0031 

0032 

0033 

0034 

0035 

0036 

0037 

0038 

0039 

003A 

003B 

003C 

0030 

003£ 

003F 

4- 

@ 

A 

B 

c 

D 

E 

F 

G 

H 

I 

J 

K 

L 

M 

N 

o 


0040 

0041 

0042 

0043 

0044 

0045 

0046 

0047 

0048 

0049 

004A 

004 B 

004 c 

0040 

004E 

004F 

5- 

P 

Q 

R 

S 

T 

u 

V 

w 

X 

Y 

z 

[ 

\ 

] 

A 



0060 

0051 

0052 

0053 

0064 

0055 

0056 

0067 

0058 

0059 

005A 

0058 

006C 

0050 

005E 

005F 

CD 

i 

V 

a 

b 

C 

d 

e 

f 

g 

h 

i 

• 

J 

k 

1 

m 

n 

O 


0060 

0061 

0062 

0063 

0064 

0065 

0066 

0067 

0068 

0069 

0064 

0068 

006C 

0060 

006E 

006F 

7- 

P 

q 

r 

S 

t 

U 

V 

W 

X 

y 

Z 

{ 

1 

} 

** 



0070 

0071 

0072 

0073 

0074 

0075 

0076 

0077 

0078 

0079 

007A 

007B 

007C 

007D 

007E 

007F 

CO 

€ 

* 

> 

/ 

99 

... 

t 

1 

A 

%<! 


< 

CK 

V 

A 

J 


20 AC 

067E 

201A 

0192 

201 E 

2026 

2020 

2021 

02C6 

2030 

0084 

2039 

0152 

0686 

0698 

OOBF 

CO 

1 

& 

< 

9 

66 

99 

• 

— 

— 


TM 


> 

(E 

ZNJ 

ZJ 



06AF 

2018 

2019 

201 C 

201D 

2022 

2013 

2014 

0098 

2122 

0O9A 

203 A 

0153 

200C 

200 D 

009F 

> 

1 


t 

0 

£ 

a 

¥ 

j 

§ 

•• 

© 


« 

“1 

. 

® 

- 


00 AO 

oeoc 

00A2 

00 A3 

00A4 

00A5 

00 A6 

00A7 

00 A8 

00A9 


00 A B 

00 AC 

00AO 

00AE 

OOAF 

B- 

o 

± 

2 

3 

* 

P 

H 

. 


• 

4 

» 

V4 


% 

? 


ooeo 

GOBI 

0062 

0OB3 

0064 

00B5 

00 B6 

0067 

00 BB 

0069 

06 IB 

OOBB 

ooec 

00 BO 

00BE 

061F 

C- 



i 

1 

i 

1 

* 

C $ 

1 

o 

3 

O 

£ 

c 

e 

i 

J 



0621 

0622 

0623 

0624 

0625 

0626 

0627 

0628 

0629 

062A 

062B 

062C 

0620 

062E 

062 F 

D- 

j 

J 

j 


A 



X 

b 

b 

t 

i 

_ 


3 

Cl 


0630 

0631 

0632 

0633 

0634 

0635 

0636 

0007 

0637 

0638 

0639 

063A 

0640 

0641 

0642 

0643 

E- 

X 

a 

J 

a 


j 

<1 

3 

9 

e 

✓ 

e 

A 

e 

e 

6 

4 

i 

1 


OG€0 

0644 

00€2 

0645 

0646 

0647 

0648 

00E7 

OOE8 

00€9 

00 E A 

00€B 

0649 

064A 

00EE 

OOEF 

F- 

4 

>/ 

4 

✓ 

A 

O 

> 


-1. 

w 

x 

u 

* 

<3 

ii 

LRM 

LRM 



0646 

064C 

064 D 

064E 

00F4 

064F 

0650 

00F7 

0651 

00F9 

0652 

00FB 

00FC 

200 E 

200F 



Encodings 


• Unicode 

- Becoming the 
standard more and 
more 

- 2-byte characters 

- Widely supported 
input/display 

- Supports extended 
Arabic characters 

- Multi-script 
representation 
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Encodings 


FE70 


Arabic Presentation Forms-B 


FEFF 


• Unicode 

- Supports presentation 
forms (shapes and 
ligatures) 



FE7 

FE8 

FE9 

FEA 

FEB 

FEC 

FED 

FEE 

FEF 

0 



w- 

>Z 

> 

V 

Jz 

A 

It 



FE70 

FE80 

FE90 

FEAO 

O0 

FECO 

FEOO 

FEEO 

FEFO 

1 

FE71 

T 

FE81 

) 

FE91 

T 

L, 

FEA1 

J - 

Ol 

FECI 

vJ> 

FED1 

r 

FEE1 

J 

FEF1 

2 

FE72 

\ 

FE82 

FE92 

c 

FEA2 

02 

\ 

J a 

FEC2 

FED2 

( V 

FEE2 

L? 

FEF2 

3 

Si 

ft 

1 

i 



\ 

« 


) 


FE73 

FE83 

FE93 

FEA3 

03 

FEC3 

FED3 

FEE3 

FEF3 

4 


ft 

\ 

a. 



la 

A 

At 

- 


FE74 

FE84 

FE94 

FEA4 

04 

FEC4 

FED4 

FEE4 

FEF4 


FC40 


Arabic Presentation Forins-A 


FD1F 



FC4 

FC5 

FC6 

FC7 

FC8 

FC9 

FCA 

FCB 

FCC 

FCD 

FCE 

FCF 

FDO 

FD1 

0 

\ 

J- 

S 

* 

> 

K 


if. 

,At»* 

> 

.» 

£ 

ft 

f 



rtb 


FC« 

FC50 

Fceo 

FC70 

FC80 

FC90 

FCAO 

FC80 

FCCO 

FCOO 

FCEO 

FCFO 

FDOO 

FD10 

1 

4 

4 

c. 

j 

JA 

< 

w 


4 


ft 

> 

X 

£ 



t 

J 2 


FC4I 

FC51 

FC61 

FC71 

FC81 

FC91 

FCA1 

FCB1 

Fca 

FCD1 

FCE1 

FCF1 

FD01 

FD1 1 

2 

f 

f 

V* 


f 


4 


» 

4 

t 

— 


1 

l 

J 2 


FC42 

FC52 

FC62 

FC72 

FC82 

FC92 

FCA2 

FCB2 

FCC2 

FCD2 

FCE2 

FCF2 

FDO 2 

FD12 

3 

\ 

u> 

J* 

t 

iJ 

6 

(C 

£ 

X 

ft 

4 

Jf 

} 




FC43 

FC53 

FC63 

FC73 

FC83 

FC93 

FCA3 

FCB3 

FCC3 

FCQ3 

FCE3 

FCF3 

FD03 

FD13 
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Encoding Issues 

Arabic Display 

• Memory (logical order) -> 

OQNEE YaO0ia (Palestine) Yi QasaaEiQI (Olympics) 2000 as 2004. 

JjljdJcj (Palestine) dg; l j J pu^ l j (Olympics) 2000 j 2004. 

or this way for those with direction-bias 

.4002 as 0002 ) scipmylO ( IQiEaaasQ iY ) enitselaP ( ai0OaY EENQO 

.4002 j 0002 ) scipmylO ( jii ) enitselaP ( odj l $ 
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Encoding Issues 

Arabic Display 

• Memory (logical order) 

OQNEE YaO01a (Palestine) Yi QasaaElQI (Olympics) 2000 as 2004. 
JjljdJcj (Palestine) dg; l j J pu^ l j (Olympics) 2000 j 2004. 

• Display (visual order) 

- Bidirectional (BiDi) support 

• Numbers and Roman script 

.2004 j 2000 (Olympics) ol^opjjl g;<J (Palestine) cjEj l $ 

- Letter and ligature shaping 

.2004 j 2000 (Olympics) o l j (Palestine) Lnlj Li 
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Display Problems 



Display Encoding 

CP-1256 

ISO-8859 

Unicode 

Western 

Actual Encoding 

CP-1256 

^ S ja. Ailala 

Sj ball 

iais JAtfn® S 
cJli) Sjlauifl 

yO 0000 (fa 
qinnOffainAD 

ElOia aa0E>E INE 
Yi IEi aaEigNE 
QaQaBENaaaiE 

ISO-8859 

oS-tfej* jXa S 

Je SjLaaH 

j-iAjUeje® 

d 4j&La 

SjlaaU 

YD^gt 

£f0 tf[]»|»D QOtfGG 

®s& 

ElOeae aaexaE INE 
ae lEe aaEiQNE 
QaQaaENeaeeE 

Unicode 

Y» ■j'ialSa'Is Ja^Ja? 

©laiL-L ©la^-Iafla. . .la 

,My-u~u Afe.fe 

•j 

©la+Lgia-iiafla^ia^ia 

iaaia-J-Ja^Ja+ia^Ja© 

is? 1 LliLJaL^D^D 
^b ^Ois 
is-isLisL 
bLbL t ft 

^^b Lis<isLisLisL 
is L £_0io L ^D^Dlb ib ^ 
D t D t Djin 

S Aj^SUA (jjJuJj 

4aijjaihM Sjlaaii 

I>>£0 a 0 — 0 USUf 
U...Ut0 * U, 0© 

0-0±0© uDus 

0“0"US 

U„U„0 a 0-.0§0±0© 

0§U„0§U„Uf0 a 0±U 

~UfUS0© 


Wrong encoding 


Partial support problems 
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Encoding Issues 

Arabic Input 


Standard graphemic keyboard 
Logical order input 


' i 

1 

■ 

1 

@ 

i 2 r 

3 V 

$ % 

4 £ 5 o 

■ 

■ 

■ 

■ 

I 

- 

+ 


c 


W E 

a* 


□C 

T . 

lJ 



1 -0 
a 

'£ 

P * 
C[ 

i 

1 



A. 

j 

,s 

±L 

F 

L. 



!B 

H i 

i 

HI 

k ‘ i 

u 

L / 

r* 

H 

■ 

i 


t 

? 

~ is 

X 

c . \ 

p i 

/ 

B ^ 

N 1 

o 




m 


H 










* vLuu 
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http://www.cyrillic.com/kbd/btc.html 






















Encodings 

Buckwalter Encoding 

• Romanization 

- One-to-one mapping 
to Arabic script spelling 

- Left-to-right 

- Easy to learn/use 

- Human & machine compatible 

• Commonly used in NLP 

- Penn Arabic Tree Bank 

• Some characters can be 
modified to allow use with XML 
and regular expressions 

• Roman input/display 

• Monolingual encoding (can’t do 
English and Arabic) 

• Minimal support for extended 
Arabic characters 


<■ 

T 

J 

* 

u 

1 

1 

1 

J 

r 


m 

:= 

1 

> 

j 

z 

o 

n 

c 

3 

& 
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a 

h 

\ 

> 

< 


$ 

3 

w 

Is 

} 

O 3 

S 

Li 
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\ 

A 

t > * 

D 

Li 

* 

Y 


b 

L 

T 

=5 

F 

■* 

a 

P 

L 

Z 

aS 

H 

Cl 

t 

t 

E 

== 

K 

ill 

V 

t 

g 


a 

e 

j 
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— 

f 

u 

c 

H 

u 

f 


i 



■■ 


ip 


t 

X 

l3 

q 


■■■_■ 


d 

<4 

k 

L' 

o 



Road Map 


• Introduction 

• Orthography 

• Morphology 

- Derivational Morphology 

- Inflectional Morphology 

- Morphological Ambiguity 

- Arabic Computational Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 


Morphology 


• Type 

- Concatenative: prefix, suffix, circumfix 
-Templatic: root+pattern 

• Function 

- Derivational 

• Creating new words 

• Mostly templatic 

- Inflectional 

• Modifying features of words 

- Tense, number, person, mood, aspect 

• Mostly concatenative 



Road Map 


• Introduction 

• Orthography 

• Morphology 

- Derivational Morphology 

- Inflectional Morphology 

- Morphological Ambiguity 

- Arabic Computational Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 


Derivational Morphology 


• Templatic Morphology 



u ma i a 


Lexeme 




S-'J- 
maktub 

written 






Lexeme. Meaning = 

(Root. Meaning+Pattern. Meaning) * idiosyncrasy . Random 


katib 

writer 
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Derivational Morphology 

Root Meaning 

• v o ^ KTB = notion of “writing” 


/kitab/ 

book 


QjJiSuO 

/maktaba/ 
library 

uO 

/maktab/ 
office 


/katab/ 

write 


vs 




/maktub/ 

letter 


/maktub/ 
written 

/katib/ 
writer 
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Derivational Morphology 

Root Meaning 

• LHM-1 

• Notion of “meat” 

— ^ /lafim/ 

• Meat 

— /lahham/ 

• Butcher 
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Derivational Morphology 

Root Meaning 


• LHM-2 

• Notion of “battle” 

/ malfiama/ 

• Fierce battle 

• Massacre 

• Epic 




Derivational Morphology 

Root Meaning 


• LHM-3 

• Notion of “soldering” 

— ^ /lafiam/ 

• Weld, solder, stick, cling 

— /iltafiam/ 

• Be welded/soldered/fused 

— /multafiim/ 

• Welded, soldered, fused 
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Derivational Morphology 

Pattern Meaning 


• Verb Pattern Meaning is hard to define 


Pattern 

Pattern Meaning 

Example 

Gloss 


Ia2a3 

Basic sense of root 

ktb -> katab 

write 

II 

Ia22a3 

Intensification, causation 

ktb -> kattab 

dictate 

III 

laA2a3 

Interaction with others 

ktb -> kaAtab 

correspond with 

IV 

Aal2a3 

Causation 

jls -> Ajlas 

seat 

V 

tala22a3 

Reflexive of Pattern II 

Elm -> taEal~am 

learn 

VI 

talaA2a3 

Reflexive of Pattern III 

ktb -> takaAtab 

correspond 

VII 

Ainla2a3 

Passive of Pattern 1 

ktb -> Ain katab 

subscribe/enroll 

VIII 

A±lta2a3 

Acquiescence, exaggeration 

ktb -> Aiktatab 

register 

IX 

Ail2a33 

Transformation 

Hmr -> AiHmarr 

Turn red/blush 

EM 

Aistal2a3 

Requirement 

ktb -> Aistaktab 

ask/make_write 
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Road Map 


• Introduction 

• Orthography 

• Morphology 

- Derivational Morphology 

- Inflectional Morphology 

- Morphological Ambiguity 

- Arabic Computational Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 


Inflectional Morphology 

Derivational Morphology 

- Lexeme « Root + Pattern 

Inflectional Morphology 

- Word = Lexeme + Features 

Features 

- Part-of-speech 

• Traditional : Noun, Verb, Particle 

• Computational : N, PN, V, Adj, Adv, P, Pron, Num, Conj, Det, 
Aux, Pun, I J, and others 

- Noun-specific 

• Number: singular, dual, plural, collective 

• Gender: masculine, feminine, Neutral 

• Definiteness: definite, indefinite 

• Case: nominative, accusative, genitive 

• Possessive clitic 
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Inflectional Morphology 


Features (continued) 

- Verb-specific 

• Aspect: perfective, imperfective, imperative 

• Voice: active, passive 

• Tense: past, present, future 

• Mood: indicative, subjunctive, jussive 

• Subject (Person, Number, Gender) 

• Object clitic 

- Others 

• Single-letter conjunctions 

• Single-letter prepositions 



Inflectional Morphology 

Nouns 





/wakabiyutina/ 

^ ^ + j 

wa+ka+biyut+na 
and+like+houses+our 
And like our houses 


CjLuS-all j 

/walilmaktabat/ 

dil+4 

wa+li+al+maktaba+at 
and+for+the+library+plural 
And for the libraries 


Morphotactics (e.g. J'+J -> J) 
Arabic Broken Plurals (templatic) 


Inflectional Morphology 

Verbs 

object Hn subj Hh verb H-(tenseH“( conj ) 


UUlaa 


/faqulnaha/ 
U +li +Jta +<_i 

fa+qul+na+ha 
so+said+we+it 
So we said it. 


/wasanaquluha/ 

U + +(j +(_>u + J 

wa+sa+na+qul+u+ha 
and+will+we+say+it 
And we will say it 


Morphotactics 

Subject conjugation (suffix or circumfix) 
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Inflectional Morphology 


• Perfect verb subject conjugation ( suffixes only) 



Singular 

Dual 

Plural 

1 

katabtu 

tjfS katabna 

2 

katabta 

katabtuma 

katabtum 

3 

kataba 

kataba 

katabtu 

• 

Imperfect verb subject conjugation (prefix+ suffix) 


Singular 

Dual 

Plural 

1 

aktubu 

naktubu 

2 

taktubu 

taktuban 

ujf& taktubun 

3 

yaktubu 

yaktuban 

(jja^aa yaktubGn 
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Feminine form and other verb moods not shown 


Road Map 


• Introduction 

• Orthography 

• Morphology 

- Derivational Morphology 

- Inflectional Morphology 

- Morphological Ambiguity 

- Arabic Computational Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 


Morphological Ambiguity 

Derivational ambiguity 

- basis/principle/rule, military base, Qa'ida/Qaeda/Qaida 

Inflectional ambiguity 

- you write, she writes 

- Segmentation ambiguity 

• he found; j: and+grandfather 

. Aiil; AiJ+J: for a language; aaII'+J: for the language 

Spelling ambiguity 

- Optional diacritics 

• /katib / writer , /katab/ to correspond 

- Suboptimal spelling 

£ 

• Hamza dropping: 

• Undotted ta-marbuta: » -> » 

• Undotted final ya: -> l $ 



Morphological Ambiguity 

• Multiple sources of ambiguity 

- /bayyana/ Verb he declared/demonstrated 

- / bayyanna/ Verb they [feminine] declared/demonstrated 

- / bayyin/ Adj clear/evident/explicit 

- /bayna / Prep between/among 

- / biyin/ Proper Noun in Yen 

- /biyn/ Proper Noun Ben 

• Hard to measure specific causes of ambiguity 

- Derivational ambiguity* (diacritized tokens) 

• 1 .09 entries/token 

• 1.01 entries/token (within same part-of-speech) 

- Spelling ambiguity* (undiacritized tokens) 

• 1 .28 entries/token 

• 1.08 entries/token (within same part-of-speech) 
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* in Buckwalter’s Lexicon (~ 40,000 lexemes) 



Morphological Ambiguity 

Average overall ambiguity* is 2.5 analyses/word 

• Compare to English ENGTWOL ambiguity (1.7-2. 2 analyses/word) 

40% 

35% 

f 30% 

o 

5 25% 

H- 

0 

§, 20% 

1 15% 

0 

1 10% 

5% 

0% 

1 2 3 4 5 6 78 or 

more 
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* In Arabic Penn Treebank 1 



Analyses/Word 
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• Introduction 

• Orthography 

• Morphology 

- Derivational Morphology 

- Inflectional Morphology 

- Morphological Ambiguity 

- Arabic Computational Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 


Arabic Computational Morphology 

• Representation units 

• Natural token ol 

-White space separated strings (as is) 

-Can include extra characters (e.g. tatweel/kashida) 

• Word o L&oJUs 

• Segmented word 

-Can include any degree of morphological analysis 
-Pure segmentation: 

-Arabic Treebank tokens (with recovery of some 
deleted/modified letters): 
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Arabic Computational Morphology 

• Representation units (continued) 

• Prefix + Stem + Suffix 

— ol + v^J0+ JJs 

-Can create more ambiguity 

• Lexeme + Features 

— cijjtSuo[+ Plural +Def + 3 + J] 

• Root + Pattern + Features 

— v_iJlS + e>a3a21a/> + [+ Plural +Def +J + 3 ] 
-Very abstract 

• Root + Pattern + Vocalism + Features 

— v_iJlS + 6321/> + a. a. a + [+Plural +Def +J + 3 ] 
-Very very abstract 
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Arabic Computational Morphology 


Approaches 

- Finite state machines (Beesely,2001) (Kiraz,2001) (Habash etal, 2005b) 

- Concatenative analysis/generation (Buckwiater,2002) (Cavaiii-Sforza et 
al, 2000) 

- Lexeme+Feature analysis/generation (Habash, 2004 ) 

- Shallow stemming (Darwish,2002) (Aljlayl and Frieder 2002) 

- Machine learning (Diab et al,2004) (Lee et al,2003) (Rogati et al, 2003) 
(Habash & Rambow 2005a) 

Issues 

- Appropriateness of system representation for an application 

• Machine Translation vs. Information Retrieval 

• Arabic spelling vs. phonetic spelling 

- System coverage 

- System extendibility 

- Availability to researchers 

- Use for analysis and generation 



Road Map 


Introduction 

Orthography 

Morphology 

Syntax 

- Morphology and Syntax 

- Sentence Structure 

- Phrase Structure 

- Computational Resources 

Machine Translation Issues 
Dialects 


Morphology and Syntax 

• Rich morphology crosses into syntax 

- Pro-drop / Subject conjugation 

- Verb subcategorization and object clitics 

• Verb transitive +sub ject+object 

• Verb intransitive +Sub ject but not Verb intransitive +subject+object 

• Verb passive +subject but not Verb passive +subject+object 

• Morphological interactions with syntax 

- Agreement 

• Full: e.g. Noun-Adjective on number, gender, and definiteness 

• Partial: e.g. Verb-Subject on gender (in VSO order) 

- Definiteness 

• Noun compound formation, copular sentences, etc. 

• Nouns+DefiniteArticle, Proper Nouns, Pronouns, etc. 
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Morphology and Syntax 

• Morphological interactions with syntax (continued) 

- Case 

• MSA is case marking: nominative, accusative, genitive 

• Al most-free word order 

• Case is often marked with optionally written short vowels 

- This effectively limits the word-order freedom in published text 

• Agglutination 

- Attached prepositions create words that cross phrase 
boundaries 

oLiSuoJI+J li+Almaktabat 

for the-libraries [PP li [NP Almaktabat]] 

• Some morphological analysis ( minimally segmentation) 
is necessary even for statistical approaches to parsing 
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Road Map 


Introduction 

Orthography 

Morphology 

Syntax 

- Morphology and Syntax 

- Sentence Structure 

- Phrase Structure 

- Computational Resources 

Machine Translation Issues 
Dialects 


Sentence Structure 


Two types of Arabic Sentences 

• Verbal sentences 

- [Verb Subject Object] (VSO) 

Wrote the-boys the-poems 
The boys wrote the poems 

• Copular sentences 

- [Topic Complement] 

the-boys poets 
The boys are poets 
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Sentence Structure 


Verbal sentences 

- Verb agreement with gender only 

. jVjVMjII wrote 3MascSing the-boy/the-boys 

. wrote 3FemSing the-girl/the-girls 

- Pronominal subjects are conjugated 

. wrote-you MascSing 
. wrote-you MascPlur 

. wrote-they MascPlur 

- Passive verbs 

• Same structure: Verb passive Subject underlying0bject 

• Agreement with surface subject 


Sentence Structure 


Verbal sentences 


- Common structural ambiguity 

• Third masculine/feminine singular are structurally 
ambiguous 

— ®^^3MascSingular ^^LJ^Masc 

Verb subject=he object=Noun 
Verb subject=Noun 


• Passive and active forms are often similar in 
standard orthography 

- /kataba/ he wrote 


- /kutiba/ it was written 



Sentence Structure 


Copular sentences 

- [Topic Complement] 

Definite Topic, Indefinite Complement 

the-boy poet 
The boy is a poet 

- [Auxiliary Topic Complement] 

Auxiliaries ( kana and her sisters) 

• Tense, Negation, Transformation, Persistence 

• tjtU; jijli oLS was the-boy poet The boy was a poet 

. ijcUi jjjii (jj is-not the-boy poet The boy is not a poet 

- Inverted order is expected in certain cases 

• Indefinite topic 

JjS /9andi kitabun/ at-me a-book / have a book 

•* 


Sentence Structure 


• Copular sentences 

- Types of complements 

• Noun/Adjective/Adverb 

_ ^ jJji! the-boy smart The boy is smart 

• Prepositional Phrase 

- ^ Aj\\ the-boy in the-library The boy is in the library 

• Copular-Sentence 

- A ji! [the-boy [book-his big]] The boy, his book is big 

• Verb-Sentence 

[the-boys [wrote-they poems]] The boys wrote the poems 

- Full agreement in this order (SVO) 

[the-poems [wrote-it the boys]] The poems, the boys wrote 
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Road Map 


Introduction 

Orthography 

Morphology 

Syntax 

- Morphology and Syntax 

- Sentence Structure 

- Phrase Structure 

- Computational Resources 

Machine Translation Issues 
Dialects 


Phrase Structure 


Noun Phrase 

- Determiner Noun Adjective PostModifier 

. rfjutt (> £ j-aiail |j>A 

this the-writer the-ambitious the-arriving from Japan 
This ambitious writer from Japan 

- Noun-Adjective agreement 

• number, gender, definiteness 

- the-writer fem the-ambitious fem 

- the-writer femP | Ur the-ambitious femPlur 



Phrase Structure 


• Noun Phrase 

- Idafa construction 

• Nouni of Noun2 encoded structurally 

• Nouni -indefinite Noun2-definite 

king Jordan 

the king of Jordan /Jordan’s king 

- Nouni becomes definite 

• Agrees with definite adjectives 

- Idafa chains 

• N 1 N 2 Nh - 1 N n 

indef indef indef def 

son uncle neighbor chief committee management the- 
company 

The cousin of the CEO’s neighbor 
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Phrase Structure 


• Morphological definiteness interacts with syntactic structure 




Word 1 writer 



definite 

Indefinite 


0 
-t— » 

Noun Phrase 

Noun Compound 

CO 

jUall e_ul£]l 

jUall <_ul£ 

■ 

CO 

*<n 

0 

~n 

The artist(ic) writer 

The writer of the artist 

3 

o 




CM 

"D 

0 

1 j 

Copular Sentence 

Noun Phrase 

s_ 

O 


<jli3 

jUa 


0 

"D 

The writer is an artist 

An artist(ic) writer 


73 















Road Map 
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Computational Resources 


Monolingual corpora for building language models 

- Arabic Gigaword 

• Agence France Presse 

• AlHayat News Agency 

• AnNahar News Agency 

• Xinhua News Agency 

- Arabic Newswire 

- United Nations Corpus (parallel with other UN languages) 

- Ummah Corpus (parallel with English) 

Distributors 

- Linguistic Data Consortium (LDC) 

- Evaluations and Language resources Distribution Agency 
(ELDA) 



Computational Resources 


• Penn Arabic Treebank (PATB) 

- Started in 2001 

- Goal is 1 Million words 


- Currently 650K words 

• Agence France Presse , AlHayat newspaper, AnNahar 


newspaper 

• POS tags 

- Buckwalter analyzer 

- Arabic-tailored POS list 

• PATB constituency 
representation 




B-vp 
B-prt 
| l-i a 

— ta+t^asiE 
[g-hlP^SEJ 

t ftl+*marAkiz 


- Some modifications of Penn English Treebank 


• (e.g. Verb-phrase internal subjects) 
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Computational Resources 


• Prague Dependency Treebank 


• Currently 100k words 

• Partial overlap with PATB 
and Arabic Gigaword 

- Agence France Presse, 
AlHayat and Xinhua 

• Morphological analysis 

- Similar to PATB 


* 


???_Pred 
ya+SoEad 
(hie) gets on 


$ 


o 

r 

s 

-huwa 


Sb 


he 


o 

Obj 

Al+bAS 
thie bus 


AuxY 

wa- 

and 


• Dependency representation 


Graphic courtesy of Otakar Smrz: http://ckl.mff.cuni.cz/padt/PADT 1.0/docs/slides/2003-eacl-trees.ppt 


Computational Resources 


• Applications using Penn Arabic Treebank 

- Statsitical parsing 

• Bikel’s parser (Bikel 2003) 

- Same engine used with English, Chinese and Arabic 

- POS tagging and morphological disambiguation 

• (Diab et al, 2004) and (Habash and Rambow, 2005a) 

• Arabic pos tagging (Khoja, 2001 ) 

• Formalism conversion 


- Constituency to dependency (Zabokrtsky and Smrz 2003) 

- Tree-adjoining grammar extraction (Habash and Rambow 


2004) 

• Automatic diacritization 
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Road Map 


• Introduction 

• Orthography 

• Morphology 

• Syntax 

• Machine Translation Issues 

- Morphology and Translation 
-Translation Divergences 

- Computational Resources 

• Dialects 


Morphology and Translation 
which level to go down to? 

• Natural token ol 

• Word oLi5uoJU3 

• Segmented Word oL&oJI J 3 

• Prefix + Stem + Suffix oI+v_aiSuo+JJ 3 

• Lexeme + Features cuiSLo [+piurai +Def +j + 5 ] 

• Root + Pattern + Features 

o vil + 6c)3d21cl^o "I - [+Plural +D 6 f +J + 3 ] 
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Morphology and Translation 
What approach? 

• Natural token Not Appropriate 

• Word Statistical MT 

• Segmented Word Statistical MT 

• Prefix + Stem + Suffix Statistical/Symbolic 

• Lexeme + Features Symbolic MT 

• Root + Pattern + Features Too Abstract? 
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Morphology and Translation 

What resources? 

• Available resources may span different levels of 
representation! 

• Most dictionaries are lexeme-based 

• Buckwalter stem dictionary contains English glosses 

• Statistical translation lexicons depend on the type of 
tokenization used before alignment 

- Word (no disambiguation necessary) 

- Segmented word (minimal disambiguation necessary) 

- Stem/Lexeme (machine/human disambiguation necessary) 

• Consistency is important 
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Road Map 


• Introduction 

• Orthography 

• Morphology 

• Syntax 

• Machine Translation Issues 

- Morphology and Translation 
-Translation Divergences 

- Computational Resources 

• Dialects 


Translation Divergences 

• Beyond word-order variation 

- Arabic VSO - English SVO 

- Arabic N Adj - English Adj N 

• Meaning of two translationally equivalent constituents is 
distributed differently in two languages 

• Divergence dimensions 

- Categorial Variation (develop development) 

- Conflation (become frozen -> freeze) 

- Inflation (freeze -> become frozen) 

- Structural (enter the room -> enter into the room) 

- Head Swap (swim across the river -> cross the river swimming) 

- Thematic (John likes Mary -> Mary pleases John) 
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Translation Divergences 

conflation 




have a book 


* 

at-me book 
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Translation Divergences 

conflation 




I am not here 

-am-not here 


86 


Translation Divergences 

structural 



Jj j 

book Nizar 



Nizar’s book 
Book of Nizar 
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Translation Divergences 

structural 




Cj jiic i found the book 

found-I upon the-book 
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Translation Divergences 

thematic & conflational 



head-my hurts-me my head hurts 



I have a headache 
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Translation Divergences 

head swap and categorial 



I swam across the river quickly 

I-sped crossing the-river swimming 
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Road Map 


• Introduction 

• Orthography 

• Morphology 

• Syntax 

• Machine Translation Issues 

- Morphology and Translation 
-Translation Divergences 

- Computational Resources 

• Dialects 


Computational Resources 

Dictionaries 

- Buckwalter stem dictionary (LDC) 

- Salmone dictionary (Tufts university) 

- Online dictionaries - Ajeeb.com (Sakhr), Almisbar.com, 
Ectaco.com 

Parallel corpora (LDC) 

- United Nations Corpus (parallel with other UN languages) 

- Ummah Corpus (parallel with English) 

- Arabic News Translation Corpus 

- Arabic Treebank English Translation 

- More on LDC webpage . . . 

MT evaluation 

- Arabic-English Multi-translation Corpus (LDC) 

- NIST’s MT-EVAL 

• Statistical MT systems are the state-of-the-art 
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Introduction 

Orthography 

Morphology 

Syntax 

Machine Translation Issues 
Dialects 

- General Definitions 

- Phonological & Lexical Variation 

- Morphological Variation 

- Syntactic Variation 

- Code Switching 

- Computational Resources 



lam jajtari nizar tawilatan ^ad id a tan 




didn’t buy Nizar table new 

nizar majtara/ tarabeza gidTda 
nizar majtaraj tawile ^dlde 
nizar majra/ mida £dTda 


•* v v t i* t ** :-i 1 * * 

jl JJ 

AijUa ^)!^j 


Nizar not-bought-not table new 
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General Definitions 


What is a ‘dialect’? 

- Political and Religious factors 

Modern Standard Arabic 

Regional Dialects 

- Egyptian Arabic (EGY) 

- Levantine Arabic (LEV) 

- Gulf Arabic (GULF) 

- North African Arabic (NOR) 

- Iraqi, Yemenite, Sudanese, Maltese? 

Social dialects 

- City 

- Peasant 

- Bedouin 



General Definitions 


• Diglossia 

• Badawi’s levels 


- Traditional Arabic 

- Modern Arabic - 


- Educated Colloquial 

- Literate Colloquial 

- Illiterate Colloquial 

Polyglossia 




Classical Dialect Foreign 96 
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- Phonological & Lexical Variation 
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- Code Switching 

- Computational Resources 


Phonological Variation 

MSA 



ljuwhnm 1 kq f k 

^ ■ ■ S J szrSdxhcfcBtba? 

LEV 


(J> J ^ J til Jj ui £ £ la Ja (jia I (j {$ j ] 1 1 f 




UW h 


e o 


z 


• No dialect-specific standard orthography 


Lexical Variation 

• Arabic Dialects vary widely lexically 


English 

table 

cat 

of 

(I) want 

there is 

there isn't 

MSA 

Tawila 


qiTTa 


idafa 

Airidu 


yujadu 

la yujadu 

Moroccan 

mida 


qeTTa 


dval 

bgit 

kayn 
. 

tna kavns 



Egyptian 

Tarabeza 

’oTTa 


bita3 

3awez 

a 

mails 

Syrian 

Tawle 


bisse 


taba3 

biddi 

fi 

ma fi 

Iraqi 

mez 


bazzuna 

mal 

’arid 


aku 

rnakii 


• Arabic orthography allows consolidating some 
variations 
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Morphological Variation 

Nouns 

- No case marking 

• Word order implications 

- Paradigm reduction 

• Consolidating masculine & feminine plural 

Verbs 

- Paradigm reduction 

• Loss of dual forms 

• Consolidating masculine & feminine plural (2 nd , 3 rd person) 

• Loss of morphological moods 

- Subjunctive/jussive form dominates in some dialects 

- Indicative form dominates in others 

- Other aspects increase in complexity 



Morphological Variation 

Verb Morphology 



MSA 

<1 lA ji&i j 

walam taktubuha lahu 
wa+lam taktubu+ha la+hu 

and+not_past write_you+it for+him 


EGY 

wimakatabtuhaluj 
wi+ma+katab+tu+ha+ u+J 

and+not+wrote+you+it+for_him+not 


And you didn’t write it for him 
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Morphological Variation 

Verb conjugation 


• Perfect verb derivation ( suffixes only) 



1 st Person Singular 

2 nd Person 
Singular $ 

2 nd Person 
Singular 5 

MSA 

katabtu 

katabta 

katabti 

LEV 

katabt 

katabti 


• Imperfect verb derivation ( prefix+suffix ) 



1 st Person Singular 

2 nd Person 
Singular $ 

2 nd Person 
Singular $ 

MSA 

aktubu 

taktubu 

taktubTna 

taktubT 

LEV 

aktob 

toktob 

toktobi 


Morphological Variation 


Tense expression 



Perfect 

Imperfect 

M 

S 

A 

L 

l a 



l 1 1 ii 

* 

kataba 

Past 

♦ ♦♦ 

jaktubu 

Present 



* ♦♦ 

sajaktubu 

Future 

L 

E 

V 

♦ 

katab 

Past 

♦ ♦* 

jiktob 

0-Tense 

L- 

* *♦* 

bjoktob 

Present 

habitual 

l r^i.i ac. 

♦ »* « ^ 

9am bjoktob 

Present 

progressive 

L. 

• *♦ 

bajiktob 

Future 
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- Code Switching 
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Syntactic Variation 

• Verbal sentences 

- The children wrote poems 

- MSA 

• Verb Subject Object (Partial agreement) 

wrote masc the-boys the-poems 

• Subject Verb Object (Full agreement) 

the-boys wrote mascPlural the-poems 

- LEV, EGY 

• Subject Verb Object 

The-boys wrote mascPlural the-poems 

• Less present: Verb Subject Object 

wrote rnaS cp| Ura | the-boys the-poems 

• Full agreement in both order 
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Syntactic Variation 

Noun Phrase 

- Idafa construction 

• Nouni of Noun2 encoded structurally 

king Jordan 

the king of Jordan /Jordan’s king 

- Dialects have an additional common construct 

• Nouni <particle> Noun2 

• LEV: jVi £& the-king belonging-to Jordan 

• <particle> differs widely among dialects 

- Pre/post-modifying demonstrative article 

• MSA: this the-man this man 

• EGY: ^ jil the-man this this man 
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Code Switching 


MSA and Dialect mixing in speech 

• phonology, morphology and syntax 


MSA 

LEV 


^jl^Jl AjAaIiIIj I^OOa ^^lll 3^jal o^jii AjAaIi ^ajOl l^jJa^)Lt_iJ ^C- ^111 4 iLaC. AjV J La LI V 

AjIj ^j^a^U AOal^jLad a^loj ^3 UJ& ^ 01 (_pa^)VI (_glc- (^Aia p^da^A Aka pjda^jA ^JOlOj 

Ajjj£I jl jOO ^ Jill Ail AAixJj AOal jLuA 4-ui jIaa ^0 jjfO (jl j AOal jIajaII A Jail ^»l JJal ^ 

*< ^aJtj *<» j A^ » II dll jlaj] p^da^A ^ic. Aiaal £a^)J (_£A J ^jdJ tp^da^all lAA Aj^jIi (jOO ^_g3 AaaLuo 
^»Oaj C—fljUall A*J j}A (jOO ^_g3 al laill ^_gjuAj^) ^aOaj (jOO ^_g3 ^al IVill (Ja jj^l A^ x ll dll jlajj (jc. 

(jjfLiJ Lai AjO SjokV! Akoj^jLaA (Jjlk (■" ml 3^al (_pokl^}]lj AxAliaA ‘Laj^^ll AlJ OLaC- ^A A JaLOl ^OlOj 
dVOaliVI p^da^A ^3 ^gkoi^jLaAJ Lu^ak d p^dajAll lAA diOc. Olj (jjXA ^ ^3 Q j j-iuiA (_j-akd ^g3 


j^A l _ ij I L> ^ ^jka Lajl Ajjla ^^Jl ^jA ^>>aII (. _ A laa (_£30 aj (. _ A laa (j-«^ a Aalda (. «a\ ja AaOj Lai 
AjA iVnll AiaLudll (jjoklj (djOall (JjOjI Axj La (jOk ^3 ^Li Aka AjV AjAiAliII AiaLudll (jjdj jjjfO ^A A-j jjg a~n 
A ialdll Adla^ll 3j^a jj ajj AOc. ^aa jA Laj Oak jA La OjOl aOc- dOaajLall $.IAjI AjIc. A^ajill aOc- 
La aLII lAA f.lki (jjJaiaj j)O0 ^^3 ^ajjaiOlj jLaaII jjjj La jjsljj ^3 ^ 5 ^ Adlaj AaJl Loa ^3 OJk 

Jll L^j3 ^ 0 ^® ^A Clla^la [_^3Li<a p jda j^a jl£ ^jjjOI C - A Jak Lajj Oakll oLklL ^ j^)J jLoiOl LI^jjj 

Lalj l^_i3 dLa^iill ^gjl AjA^jkaJl 'Lai^jLaAjlj dll j)Uj.i ^j^jVI O^lk dull! Oi l^_i3 IjA^iill l^_l3 l^kalj Ajusi I jJ-da 

Vij 01 ^gJal^)kajAll p^jJa^all Lai ip^jka^all lAA ^g3 I Ida, ^gll ^^al ^_poJJ^)ll p^jJa^all Ia^j Oa^OI 
L_llklul 6 Alc.| ^j 3 Adlk-al jl ^jA AliAki jl ^)jkaiAil AjI O^jO (j^-a-a La (_pdJ ^jOkkll Aga^llA lAA LaLali 
^)A^a ^3 AliA ^jud-a ^A AkO AjV^jJ Aj^^^Aa. LUOa La ^^Jl dli^Aalillj ^jOa aII ^jAjJa ^gial^)kaj3 

_pjjJa^All lAA (^3 ^gJC-03 ^ lx. j O^VO lAA AOal^)LajAll 
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Aljazeera T ranscript http://www.aljazeera.net/programs/op_direction/articles/2004/7/7-23-1 .htm 
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Computational Resources 

• Most work on Arabic dialects focuses on Automatic 
Speech Recognition 

• Speech/transcript corpora 

- Egyptian and Levantine Arabic (LDC) 

- Moroccan and Tunisian Arabic (ELDA) 

- Gulf Arabic (Appen) 

- Many other... 

• Few lexicons/morphology resources 

- CallHome Egyptian Arabic monolingual lexicon (LDC) 

- CallHome Egyptian Verb transducer (LDC) 

• Work on multi-dialectic resources 

- Linguistic Data Consortium 

- Columbia University Arabic Dialect Project 

• Pan-Arab lexicon and Pan-Arab Morphology 

• Parsing Arabic Dialects (JHU summer workshop 2005) 111 



Resources 


Distributors 

• Linguistic Data Consortium 

• NEMLAR (Network for Euro-Mediterranean LAnquaqe 

Resources) 

• ELSNET is the European Network of Excellence in 

Human Language Technologies 

• ELDA Evaluation and Language resources Distribution 

Agency 
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Resources 


Reports 

• Mohamed Maamouri and Christopher Cieri. 2002. 
Resources for Natural Language Processing at the 

Linguistic Data Consortium . In Proceedings of the 
International Symposium on Processing of Arabic, pages 
125--146, Manouba, Tunisia, April 2002. 

• Mahtab Nikkhou and Khalid Choukri. Survey on Arabic 
Language Resources and Tools in the Mediterranean 

Countries . 

• Arabic Information Retrieval and Computational 

Linguistics Resources (thanks to Doug Oard) 
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Resources 


Monolingual Corpora 

• Arabic Giqaword 

• Arabic News wire 

Parallel Corpora 

• United Nations Parallel Corpus 

• Ummah Parallel Corpus 

• Arabic News Translation 

• Multiple-Translation Arabic 

Treebanks 

• Arabic Penn Treebank Webpage 

- Part 1 v 2.0 , Part 2 v 2.0 , Part 3 v 1 .0 , IQK-word English Translation 

• Prague Arabic Dependency Treebank 
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Resources 


Morphology 

• Buckwalter Arabic Morphological Analyzer 

- Version 1.0. Version 2.0 

• Xerox Arabic Morphology (online) 

Dialect Resources 

• CALLHOME Egyptian Arabic Transcripts 

• CALLHOME Egyptian Arabic Speech 

• Egyptian Colloquial Arabic Lexicon 

• Levantine Arabic Resources 

• http://www.orientel.org/ 

• http://www.appen.com.au 
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Resources 


Dictionaries 

• Buckwalter Stem Dictionary 

• H. Anthony Salmone. An Advanced Learner's Arabic- 
English Dictionary encoded by the Perseus Project, Tufts 
University (contact: David Smith dasmith@perseus.tufts.edu) 

• Aieeb Arabic-Enqlish Dictionary (online) 

• Al-Misbar Dictionar (online) 

• Ectaco Bilingual Dictionar (online) 

Online MT systems 

• Aieeb's Arabic-Enqlish Machine Translation (online) 

• Al-Misbar Enqlish-Arabic Machine Translation (online) 
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Conferences and Workshops 

with some focus on Arabic 


• ACL 2005 Workshop on Computational Approaches to Semitic Languages 

• Arabic Language Resources and Tools Conference 2004 Cairo, Egypt 

• WORKSHOP Computational Approaches to Arabic Script-based Lanquaqes 

(COLING 2004) 

• Traitement Automatique du Lanqaqe Naturel (TALN ' 04) 

• NIST MT EVAL ( http://www.nist.gov/speech/tests/mt/ ) 

• MT Summit IX Workshop on Machine Translation for Semitic Languages in 

2003 

• LREC 2002 Arabic Language Resources and Evaluation Workshop 

• ACL 2002 Workshop on Computational Approaches to Semitic Languages 

• International Symposium on Processing of Arabic 2002, Tunisia 

• Workshop on ARABIC Language Processing: Status and Prospects 

(ACL/EACL 2001) 

• Arabic Translation and Localisation Symposium (ATLAS 1999) 

• Computational Approaches to Semitic Languages (COLING/ACL 1998) 
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